Selectivity Estimation for Fuzzy String Predicates in Large Data Sets
نویسندگان
چکیده
Many database applications have the emerging need to support fuzzy queries that ask for strings that are similar to a given string, such as “name similar to smith” and “telephone number similar to 412-0964.” Query optimization needs the selectivity of such a fuzzy predicate, i.e., the fraction of records in the database that satisfy the condition. In this paper, we study the problem of estimating selectivities of fuzzy string predicates. We develop a novel technique, called Sepia, to solve the problem. It groups strings into clusters, builds a histogram structure for each cluster, and constructs a global histogram for the database. It is based on the following intuition: given a query string q, a preselected string p in a cluster, and a string s in the cluster, based on the proximity between q and p, and the proximity between p and s, we can obtain a probability distribution from a global histogram about the similarity between q and s. We give a full specification of the technique using the edit distance function. We study challenges in adopting this technique, including how to construct the histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. Our extensive experiments on real data sets show that this technique can accurately estimate selectivities of fuzzy string predicates. ∗ Supported by NSF CAREER Award No. IIS-0238586. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005
منابع مشابه
Topic-based Selectivity Estimation for Hybrid Queries over RDF Graphs
The Resource Description Framework (RDF) has become an accepted standard for describing entities on the Web. Many such RDF descriptions are text-rich – besides structured data, they also feature large portions of unstructured text. As a result, RDF data is frequently queried using predicates matching structured data, combined with string predicates for textual constraints: hybrid queries. Evalu...
متن کاملCXHist : An On-line Classification-Based Histogram for XML String Selectivity Estimation
Query optimization in IBM’s System RX, the first truly relational-XML hybrid data management system, requires accurate selectivity estimation of path-value pairs, i.e., the number of nodes in the XML tree reachable by a given path with the given text value. Previous techniques have been inadequate, because they have focused mainly on the tag-labeled paths (tree structure) of the XML data. For m...
متن کاملMulti-Dimensional Substring Selectivity Estimation
With the explosion of the Internet, LDAP directories and XML, there is an ever greater need to evaluate queries involving (sub)string matching. In many cases, matches need to be on multiple attributes/dimensions, with correlations between the dimensions. EEective query optimization in this context requires good selectivity estimates. In this paper, we use multi-dimensional count-suux trees as t...
متن کاملFuzzy Inference System Approach in Deterministic Seismic Hazard, Case Study: Qom Area, Iran
Seismic hazard assessment like many other issues in seismology is a complicated problem, which is due to a variety of parameters affecting the occurrence of an earthquake. Uncertainty, which is a result of vagueness and incompleteness of the data, should be considered in a rational way. Using fuzzy method makes it possible to allow for uncertainties to be considered. Fuzzy inference system,...
متن کاملFuzzy Inference System Approach in Deterministic Seismic Hazard, Case Study: Qom Area, Iran
Seismic hazard assessment like many other issues in seismology is a complicated problem, which is due to a variety of parameters affecting the occurrence of an earthquake. Uncertainty, which is a result of vagueness and incompleteness of the data, should be considered in a rational way. Using fuzzy method makes it possible to allow for uncertainties to be considered. Fuzzy inference system,...
متن کامل